SeqHBase: a big data toolset for family based sequencing data analysis

نویسندگان

  • Min He
  • Thomas N Person
  • Scott J Hebbring
  • Ethan Heinzen
  • Zhan Ye
  • Steven J Schrodi
  • Elizabeth W McPherson
  • Simon M Lin
  • Peggy L Peissig
  • Murray H Brilliant
  • Jason O'Rawe
  • Reid J Robison
  • Gholson J Lyon
  • Kai Wang
چکیده

BACKGROUND Whole-genome sequencing (WGS) and whole-exome sequencing (WES) technologies are increasingly used to identify disease-contributing mutations in human genomic studies. It can be a significant challenge to process such data, especially when a large family or cohort is sequenced. Our objective was to develop a big data toolset to efficiently manipulate genome-wide variants, functional annotations and coverage, together with conducting family based sequencing data analysis. METHODS Hadoop is a framework for reliable, scalable, distributed processing of large data sets using MapReduce programming models. Based on Hadoop and HBase, we developed SeqHBase, a big data-based toolset for analysing family based sequencing data to detect de novo, inherited homozygous, or compound heterozygous mutations that may contribute to disease manifestations. SeqHBase takes as input BAM files (for coverage at every site), variant call format (VCF) files (for variant calls) and functional annotations (for variant prioritisation). RESULTS We applied SeqHBase to a 5-member nuclear family and a 10-member 3-generation family with WGS data, as well as a 4-member nuclear family with WES data. Analysis times were almost linearly scalable with number of data nodes. With 20 data nodes, SeqHBase took about 5 secs to analyse WES familial data and approximately 1 min to analyse WGS familial data. CONCLUSIONS These results demonstrate SeqHBase's high efficiency and scalability, which is necessary as WGS and WES are rapidly becoming standard methods to study the genetics of familial disorders.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A High-performance Computing Toolset for Big Data Analysis of Genome-Wide Variants

3 Features 8 3.1 Features of CoreArray . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.2 Features of SNPRelate for SNP Data . . . . . . . . . . . . . . . . . . . . . . 9 3.2.1 Data Structure for SNPRelate . . . . . . . . . . . . . . . . . . . . . . 9 3.2.2 Functions of SNPRelate . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.3 Features of SeqArray for Sequencing Data . ....

متن کامل

ONETOOL for the analysis of family-based big data.

Motivation Despite the need for separate tools to analyze family-based data, there are only a handful of tools optimized for family-based big data compared to the number of tools available for analyzing population-based data. Results ONETOOL implements the properties of well-known existing family data analysis tools and recently developed methods in a computationally efficient manner, and so ...

متن کامل

2016 Olympic Games on Twitter: Sentiment Analysis of Sports Fans Tweets using Big Data Framework

Big data analytics is one of the most important subjects in computer science. Today, due to the increasing expansion of Web technology, a large amount of data is available to researchers. Extracting information from these data is one of the requirements for many organizations and business centers. In recent years, the massive amount of Twitter's social networking data has become a platform for ...

متن کامل

Design and Test of the Real-time Text mining dashboard for Twitter

One of today's major research trends in the field of information systems is the discovery of implicit knowledge hidden in dataset that is currently being produced at high speed, large volumes and with a wide variety of formats. Data with such features is called big data. Extracting, processing, and visualizing the huge amount of data, today has become one of the concerns of data science scholar...

متن کامل

Big Data Analytics for Genomic Medicine

Genomic medicine attempts to build individualized strategies for diagnostic or therapeutic decision-making by utilizing patients' genomic information. Big Data analytics uncovers hidden patterns, unknown correlations, and other insights through examining large-scale various data sets. While integration and manipulation of diverse genomic data and comprehensive electronic health records (EHRs) o...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره 52  شماره 

صفحات  -

تاریخ انتشار 2015